|
|
|
Author: Ijeoma Nwachukwu
Date: 2025-08-26
The Biostatistics and Health Data Science Group, is a multi-disciplinary academic research and teaching under the IAHS characteristic by collaborative research, consultancy and training across clinical, biological and global health domains. In the global health domain where I was assigned to, the data used to conduct the research as well as for training purposes are collected from a number of secure sources, including the The DHS-Program.
The DHS-Program, funded by USAID collects nationally representative global health data, to monitor and evaluate population, health, and nutrition programs, providing data to track approximately 30 SDG indicators. They provides these data for tracking as well as measure to track them, contributing significantly towards achieving the SDG 3 and 5 (The DHS Program, 2025).
However, the DHS-Program has been suspended and currently undergoing review for further funding. During the period of this review, new registrations are not being accepted, hence restricting access to datasets commonly used by undergraduate and post graduate students for their theses and training, especially in LMICs, thereby significantly hampering preparations for future national and global health leadership training in addition to other far-reaching effects.
My project focused on collecting, organizing, merging and analyzing, datasets from DHS-program relevant to our global health projects. While this mitigates the recent suspension of the DHS-program for students and researchers within the team working on global health projects, it also gave me an opportunity to familiarize with global health data and perform exploratory data analysis on aspects of Gender Inequality including Female Genital Mutilation, Intimate Partner Violence and Autonomy of Health Care Decision Making which are often intertwined and are prevalent issues for women of child bearing age in LMICs(Wessells & Kostelny, 2022).
This project achieved two aims
The project was carried out in three phases and documentation was ensured for transparency and reproducibility of the workflow and analysis results.
The data was from DHS-Program website using my supervisor’s login. To access datasets, new users must register for an account on the The DHS-Program website.
A structured reproducible workflow was scripted using R Markdown
which serves as a comprehensive toolkit for accessing, processing, and
locally managing DHS downloads, enabling seamless data retrieval for
collaborative research in support of global health studies. It ensure
secure data access, automates downloads, and systematically unzips,
organizes and saves the datasets in hierarchical file
structure.FileName/CountryName/SurveyYear/DataType. The
workflow is specifically for DHS Datasets in SPSS and STATA formats as
specified in my project tasks.
A structured, reproducible workflow was developed to merge DHS
Individual Recode (IR) datasets for 2 countries (Kenya and Tanzania
2022) using SPSS Syntax. A cross-Country unique identifiers
UCASEID was created by concatenating Country-cluster and
case IDs. Subsets containing the UCASEID and relevant IPV
variables were saved and merged using SPSS commands. This workflow can
be adapted for additional countries and survey rounds, and replicated
for different variables, provided that the variable names, labels, and
meanings are first confirmed to be consistent according to the DHS
Recode Manual (The DHS Program, 2025). See syntax of workflow in Appendix 1.
Exploratory data analysis was done using datasets from IPUMS-DHS website which are harmonized DHS survey datasets across countries and over time. Initial Cross-tabulations was done for key variables using SPSS, SPSS output tables were cleaned excel with steps documented (see Appendix 2 ) and results imported into R to create visual interactive plots.
library(plotly)
library(dplyr)
library(readxl)
library(tidyr)
library(forcats)
# Read data
library(readxl)
ipv_fgm_ahcdm_spss_output <- read_excel("ipv-fgm-ahcdm-spss-output.xlsx",
sheet = "ipv")
#Clean data
library(readxl)
library(dplyr)
# List of sheet names
sheet_names <- c("ipv", "fgm", "ahcdm")
# Define cleaning function
library(readxl)
library(dplyr)
# List of sheet names
sheet_names <- c("ipv", "fgm", "ahcdm")
# Define a cleaning function
clean_spss_output <- function(df) {
df %>%
# Remove first 3 rows and last row
slice(-(1:3), -n()) %>%
# Remove first column and last column
select(-1, -ncol(.)) %>%
# Rename second column to 'Country'
rename(Country = 1) %>%
# Convert columns 2 to end to numeric
mutate(across(2:ncol(.), as.numeric))
}
# Read and clean each sheet
cleaned_data <- lapply(sheet_names, function(sheet) {
df_raw <- read_excel("ipv-fgm-ahcdm-spss-output.xlsx", sheet = sheet)
clean_spss_output(df_raw)
})
# Assign cleaned data to named objects for easy access
names(cleaned_data) <- sheet_names
ipv_clean <- cleaned_data$ipv
fgm_clean <- cleaned_data$fgm
ahcdm_clean <- cleaned_data$ahcdm
#Rename categories columns and delete Missing, NIU and Total Values
df_ipv <- ipv_clean %>%
select(
1:ncol(.)
) %>%
rename(Country=1,
`Not ever slapped` = 2,
`Often during last 12 months` = 3,
`Sometimes during last 12 months` = 4,
`Not at all in last 12 months` = 5,
`Yes, timing and frequency unknown` = 6
)
# Delete columns 8:9
df_ipv <- df_ipv %>% select(-any_of(c("...8","...9","...10")))
df_fgm <- fgm_clean %>%
select(
1:ncol(.)
) %>%
rename(Country=1,
`no`=2,
`yes`=3,
`don't know`=4,
)
# Delete columns 6:8,10:12
df_fgm <- df_fgm %>% select(-any_of(c("...6","...7","...8")))
df_ahcdm <- ahcdm_clean %>%
select(
1:ncol(.)
) %>%
rename(Country=1,
`Woman alone`=2,
`Woman and husband/partner`=3,
`Woman and someone else`=4,
`Husband/partner`=5,
`Family elders/relatives`=8
)
# Delete columns 6:7,9:10
df_ahcdm <- df_ahcdm %>%
select(-any_of(c("...7","...8","...9","...10","...11")))
library(plotly)
library(forcats)
df_ipv <- df_ipv
response_ipv <- c(
"Not ever slapped",
"Often during last 12 months",
"Sometimes during last 12 months",
"Not at all in last 12 months",
"Yes, timing and frequency unknown"
)
# Calculate total (include 'Missing'), then get proportions
df_ipv <- df_ipv %>%
rowwise() %>%
mutate(Total = sum(c_across(all_of(response_ipv)))) %>%
ungroup()
# Calculate percentages
for (col in response_ipv) {
df_ipv[[paste0(col, " %")]] <- 100 * df_ipv[[col]] / df_ipv$Total
}
# Reshape to long format for plotting
df_ipv_long <- df_ipv %>%
select(Country, ends_with("%")) %>%
pivot_longer(
cols = -Country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#Generate interactive plot using plotly
fig_ipv <- plot_ly(
df_ipv_long,
y = ~Country,
x = ~Percent,
color = ~Response,
type = "bar",
orientation = "h"
) %>%
layout(
barmode = "stack",
title = "Women Slapped in Last 12 Months",
yaxis = list(title = "Country", categoryorder = "total ascending"),
xaxis = list(title = "Percent of Respondents"),
legend = list(title = list(text = "Response Category"))
)
fig_ipv
#save_plotly_screenshot(fig1_ipv, "fig1_ipv.png")
#knitr::include_graphics("fig1_ipv.png")
#knitr::include_graphics("fig2_ipv.png")
This plot presents the distribution of women’s reported experiences with intimate partner violence (IPV) across countries. The response categories include: “Not ever slapped,” “Often during last 12 months,” “Sometimes during last 12 months,” “Not at all in last 12 months,” and “Yes, timing and frequency unknown.” Most countries show that a significant proportion of women have never been slapped by an intimate partner, but in many settings, notable percentages report being slapped at least sometimes or often within the past year. Variability across countries is visible, with some (e.g., Sao Tome and Principe, Zimbabwe) having higher frequencies of violence, and others (e.g., India, Senegal) showing larger shares of respondents reporting no experience of IPV.
# Read data
df_fgm <- df_fgm
fgm_response_cols <- c(
"no",
"yes",
"don't know"
)
# Calculate total (include 'Missing'), then get proportions
df_fgm <- df_fgm %>%
rowwise() %>%
mutate(Total = sum(c_across(all_of(fgm_response_cols)))) %>%
ungroup()
# Calculate percentages
for (col in fgm_response_cols) {
df_fgm[[paste0(col, " %")]] <- 100 * df_fgm[[col]] / df_fgm$Total
}
# Reshape to long format for plotting
df_fgm_long <- df_fgm %>%
select(Country, ends_with("%")) %>%
pivot_longer(
cols = -Country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#Generate interactive plot using Plotly
fig_fgm <- plot_ly(
df_fgm_long,
y = ~Country,
x = ~Percent,
color = ~Response,
type = "bar",
orientation = "h"
) %>%
layout(
barmode = "stack",
title = "Percentage of Women Ever Circumsised",
xaxis = list(title = " "),
yaxis = list(title = " "),
legend = list(title = list(text = "Response Category"))
)
#save_plotly_screenshot(fig_fgm, "fig_fgm.png")
#knitr::include_graphics("fig3_fgm.png")
This plot presents the percentage response of women who have experienced female genital mutilation/cutting (FGM/C) within Country, with responses categorized as “yes,” “no,” and “don’t know.” There is wide Country variation: nations like Guinea, Sierra Leone, Mali, Gambia, and Egypt show extremely high percentages of women reporting being circumcised (often over 80%), while countries such as Ghana, Cameroon, Tanzania, and others report relatively low response rate. The “don’t know” response is almost negligible in most contexts. The significant Country-to-Country differences reflects varying cultural, legal, and historical norms about FGM/C practices.
library(plotly)
library(forcats)
df_ahcdm <- df_ahcdm
ahcdm_response_col <- c(
"Woman alone",
"Woman and husband/partner",
"Woman and someone else",
"Husband/partner",
"Family elders/relatives"
)
# Calculate total (include 'Missing'), then get proportions
df_ahcdm <- df_ahcdm %>%
rowwise() %>%
mutate(Total = sum(c_across(all_of(ahcdm_response_col)))) %>%
ungroup()
# Calculate percentages
for (col in ahcdm_response_col) {
df_ahcdm [[paste0(col, " %")]] <- 100 * df_ahcdm[[col]] / df_ahcdm$Total
}
# Reshape to long format for plotting
df_ahcdm_long <- df_ahcdm %>%
select(Country, ends_with("%")) %>%
pivot_longer(
cols = -Country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#Generate interactive plot using Plotly
fig_ahcdm <- plot_ly(
df_ahcdm_long,
y = ~Country,
x = ~Percent,
color = ~Response,
type = "bar"
) %>%
layout(
barmode = "stack",
title = "percentage of women who have the final say on their health care within Country",
xaxis = list(title = " "),
yaxis = list(title = " "),
legend = list(title = list(text = "Response Category"))
)
fig_ahcdm
#save_plotly_screenshot(fig_ahcdm, "fig_ahcdm.png")
#knitr::include_graphics("fig4_ahcdm.png")
#knitr::include_graphics("fig5_ahcdm.png")
The chart shows women’s reported autonomy and roles in health care
decision-making by Country and response categories
response_ahcdm. In many countries, the largest proportion
of women say decisions are made “with their husband/partner” or by their
“husband/partner” alone, reflecting persistent gender norms around
health autonomy. However, countries such as Mozambique, Lesotho, and
Madagascar display higher shares for “Woman alone,” indicating stronger
female decision-making autonomy. “Woman and someone else” and “Family
elders/relatives” are minor categories in most contexts, suggesting
these are less common arrangements for household health decisions.
Output files including dataset, analysis and results are saved to One drive folder in the below order
DHS-Download Task
├── [DHS_Downloads]
└── [Downloads report, metadata, log]
[Gender Inequalities]
├── [DHS]
│ ├── [dhs-ir-piolt-merge-KE8_TZ8]
│ └── [planning-and-var-map]
└── [IPUMS]
├── [ipums-analysis]
│ ├── [r-project-files-exec-report]
│ └── [spss-analysis]
├── [ipums-data-extracts-comd-files]
├── [ipums-ir-dataset]
└── [ipums-planning-and-var-map]
The DHS Program. (2025).Sustainable Development Goals. https://dhsprogram.com/topics/sdgs/index.cfm (Accessed August 28, 2025)
The DHS Program. (2025). Merging datasets. https://dhsprogram.com/data/Merging-datasets.cfm (Accessed September 1, 2025)
Wessells, M. G., & Kostelny, K. (2022). The psychosocial impacts of intimate partner violence against women in LMIC contexts: Toward a holistic approach. International Journal of Environmental Research and Public Health, 19(21), 14488. https://doi.org/10.3390/ijerph192114488*
*SPSS
* Encoding: UTF-8.
*SPSS Version 30.0.0.0(172)
* Encoding: UTF-8.
*Check Recode file to confirm variable names context match. For this pilot merging, KEIR8CFL.SAV and TZIR82FL.SAV were conducted in the same year and survey phase (Ist Survey conducted in DHS Phase 8, in 2022).
*KEIR8CFL.SAV however is a continuous DHS Dataset. Create a copy of original dataset as these changes will over-write the original dataset. UNless otherwise specified as in Step 2
*STep1: Create Unique ID using V000 and Case ID variables from both files. to merge from Dataset 1( KEIR8CFL.SAV )
*Unique ID for Kenya; Dataset 1( KEIR8CFL.SAV ).
DATASET ACTIVATE DataSet1.
STRING UCASEID (A20).
COMPUTE UCASEID=CONCAT(V000,CASEID).
VARIABLE LABELS UCASEID 'Unique Case ID'.
EXECUTE.
*Unique ID for Tanzania; Dataset 2( TZIR82FL.SAV ).
DATASET ACTIVATE DataSet2.
STRING UCASEID (A20).
COMPUTE UCASEID=CONCAT(V000,CASEID).
VARIABLE LABELS UCASEID 'Unique Case ID'.
EXECUTE.
*Step 2: Select Unique case ID along with IPV variables from both datasets for merging. Save them with a different name. Modify file path.
DATASET ACTIVATE DataSet1.
SAVE OUTFILE='C:\Users\Desktop\_KEIR8CFL.SAV'
/KEEP UCASEID V000 V001 V003 V004 V005 V006 V007 G100 G101 G102 G103 G104 G105 G107 V005.
DATASET ACTIVATE DataSet2.
SAVE OUTFILE='C:\Users\Desktop\_TZIR82FL.SAV'
/KEEP UCASEID V000 V001 V003 V004 V005 V006 V007 G100 G101 G102 G103 G104 G105 G107 V005.
*Open _KEIR8CFL.SAV and _TZIR82FL.SAV as Datasets 3 and 4 respectively
*Step 3: Merge all variables.
DATASET ACTIVATE DataSet3.
ADD FILES /FILE=*
/FILE='DataSet4'.
EXECUTE.
*By default, the active dataset (Dataset3 _KEIR8CFL.SAV) is modified to contain the merged cases from the other dataset (Dataset4 _TZIR82FL.SAV).
SAVE OUTFILE='C:\Users\Desktop\KE8-TZ8-ir-ipv.SAV'
/COMPRESSED.
* Encoding: UTF-8
*Version 29.0.2.0 (20)
Naming conventions for CROSS TABULATIONS results for further analysis
1. ipv: percentage of women slapped in last 12 month (frequency), variable code= (DVPSLAPFQ)
2. fgm: percentage of ever circumcised women within Country, variable code= (FCCIRC)
3. ahcdm: percentage of women who have the final say on their health care within Country, variable code= (FCCIRC)
*Load datset.
GET
FILE='C:\Users\Desktop\ipums-ir-dataset.sav'.
DATASET ACTIVATE DataSet1.
CROSSTABS
/TABLES=Country BY DVPSLAPFQ
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN
/COUNT ROUND CELL.
CROSSTABS
/TABLES= Country BY FCCIRC
/FORMAT=AVALUE TABLES
/CELLS=COLUMN
/COUNT ROUND CELL.
CROSSTABS
/TABLES=Country BY DECFEMHCARE
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN
/COUNT ROUND CELL..
*-----------------------------------------------------------------------.
*For data cleaning in r
1. set row 3 as header
2. remove:
rows: 1-2,last
col: 1,last
3. rename col2: country
4. Filter and remove:
- cols:
Missing
Not in Universe col
Abbreviations - DHS - IPUMS - USAID - LMICs - All IPUMS Variable abbreviations available on IPUMS-DHS.